Introduction to Medical Statistics 2024
Exercise 7 – Multiple linear regression
Data Analysis and Model Diagnostics

Author

Du Hong Duc

Published

August 22, 2024

Exercise i (Multivariable linear regression – Peru lung data set)

The dataset perulung contains data from a study of lung function among 636 children aged 7 to 10 years living in a deprived suburb of Lima, Peru. The outcome of interest is the maximum volume of air a child could breathe out in one second measured using a spirometer (forced expiratory volume, litres/second) and we aim to predict it based on other covariables. We will use the packages ggplot2, patchwork, gtsummary and ggResidpanel. Load these packages.

  1. Plot the outcome (fev1) against the child’s age and height. Do the associations look linear?
  1. Perform 2 separate linear regression models of fev1 against age and height.
  1. Perform a multiple linear regression analysis of fev1 depending on both age and height. Compare the regression coefficients of the multiple regression with the simple regression models. Why is the coefficient for age smaller in the multiple regression model?

Calculate the 95% confidence interval for the regression coefficients.

  1. What is the interpretation of the intercept in the model c)? Scale the variables, so that both the intercept and the regression coefficients are easier to interpret.

For example, scale the variables such that the intercept corresponds to a child with age 7 years and height 120 cm and that the coefficient for height corresponds to a 10 cm increase in height.

  1. Add sex as an additional covariate to the regression model. How can the coefficient for sex be interpreted? How much of the total variability in fev1 does the model explain?
  1. Perform appropriate diagnostic plots for the model in e) using the ggResidpanel package and the code resid_interact(fit6, plots=“R”). Is there any evidence that the assumptions of the regression model are violated? There is one individual that has fairly extreme residuals in all four plots. Can you find it? What happens if you refit the model without that individual?

Exercise ii (Multivariable linear regression – Dengue viremia and interactions)

The dataset dengueViremia contains selected data from 121 children with dengue serotype 1 or 2 presenting to a community clinic in Ho Chi Minh City within 3 days of illness onset. In this exercise, we will investigate how the dengue serotype (DENV-1 or DENV-2) and the serology (primary or secondary infection) affect the child’s dengue viremia level on day 3. We will use log10-transformed viremia for all analyses.

  1. Import the dengue viremia dataset and create a boxplot of log10-viremia by type. Is there evidence of an interaction between the effect of the serotype and the effect of serology on viremia?
  1. Compare viremia-levels between primary and secondary infections with an appropriate test, combining both subtypes. In view of a), does this comparison make sense?
  1. Compare viremia-levels of primary and secondary infections in the subgroups of patients with DENV-1 and DENV-2 separately with appropriate tests.
  1. We want to assess whether dengue serotype and serology affect log10-viremia after controlling for age and gender. Model the log-10 viremia with a multiple linear regression model with the covariates age, gender, serotype, serology. What do you conclude?
  1. This model fit may not be adequate because we already known that there may be an interaction between serotype and serology. Therefore, add an interaction between serotype and serology. Create an article-ready table using the tbl_regression function from the gtsummary package. Interpret the regression output. Use the predict function, or the ggpredict function from the ggeffects package, to obtain the expected value and 95% confidence intervals for each of the four serotype-serology combinations. Choose age=11 and sex =“female” (which are the default values chosen by ggpredict.)
  1. Perform diagnostic plots for the model from e).